INTRODUCTION

Currently over 700 million individuals around the globe live in extreme poverty; and despite much progress being made over the past decades to reduce these numbers, as a result of the COVID-19 pandemic at least an additional 97 million people were forced back into poverty, making the issue more relevant than ever.

A portion of our data comes from Kiva, an international nonprofit organization, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. Kiva serves 77 countries and has funded 1.67 Billion USD worth of loans to entrepreneurs and students in underserved populations. In order for Kiva to best set its investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty for each borrower is crucial. However, attaining individual-level information is time-consuming, labor-intensive and expensive. Therefore, the creation of an easy to use, inexpensive and scalable solution, would help Kiva in its efforts to target the alleviation of poverty.

With this goal in mind, we decided to explore two key questions: How does poverty vary by region, country, and specific area? Which socioeconomic, demographic, and environmental factors have the greatest prediction power for regionalized poverty? Our aim with these questions is first, to better understand how poverty impacts various regions, and then to dive into trying to create a model for poverty prediction that could help Kiva achieve its goals more effectively and efficiently.

DATA

The dataset that our group was found on Kaggle.com. However, the dataset was compiled by data scientist Reuben Pereira in 2018. Pereira originally put this data together to aid in his attempt to generate a machine learning algorithm that would ultimately be able to predict the poverty level of any region in an impoverished country. He wanted to create an algorithm which would help Kiva, a nonprofit organization that allows people to lend money via the internet to low-income students and entrepreneurs in underserved communities, to find communities that are in need of investment. In this dataset, Pereira combined socioeconomic/demographic, climate, environmental, conflict, natural disaster, loan, and poverty data, which he got from the nonprofit Kiva and the Global Multidimensional Poverty Index, which is sourced from the United Nations Development Programme. Through this dataset, we are able to analyze 63,464 observations of 50 different variables, but we only focus on 21 of them.

The demographic data that we focused on in this analysis includes: population density (popDensity), which is a measure of the average number of people per square kilometer in the country based on the GPWv3 (Gridded Population of the World) from 1990-2015, travel time to closest city (TimeToCity), which is a measure of the number of hours it takes to travel to the nearest city based on the road networks that existed in 2000, night light (AvgNightLight), which is the mean value of the long-term lights in night images from 1992-2010, and land classification (LandClassification), which signifies the land cover class based on MERIS FR images from 2000-2005. The environmental data that we focus on in this analysis includes: MODIS leaf area index (Modis_LAI), which is mean percent foliage cover of the 8-day MODIS LAI time series data from 2001-2012, organic carbon (soil_orgc), which is the gravimetric content of organic carbon in the fine earth fraction that is <2 mm (g/kg), water pH (soil_phaq), which is the measure of the acidity or alkalinity in the soil, clay total (soil_clay), which is the gravimetric content <0.002 mm soil material in the fine earth fraction that is <2 mm (g/100g), silt total (soil_silt), which is the gravimetric content of 0.002 mm to Y mm fraction of the <2 mm soil material (g/100g). For climate data, our group focused on: precipitation (precipitation), which is the mean monthly precipitation from 2003-2006 estimated based on the GSMaP (Global Satellite Mapping of Precipitation) in milimeters, elevation (Elevation), which is a Global Relief Model based on SRTM 30+ and ETOPO DEM at 1/120 arcdeegres from 2002-2010 that shows the elevation of the region in meters, evaporation (Evaporation), which is the long-term MODIS-estimated evapotranspiration from 2000-2012 in milimeters/year, and temperature (Temperature), which is the mean value the 8-day MODIS day-time LST time series data from 2011-2012 in degrees Celsius. For each region (World.region), we also focused on the country (country) that the region is located in, the MPI of both the region and country (MPI_region & MPI_Country), which is the Global Multidimensional Poverty Index that identifies multiple deprivations at the household and individual level in health, education and standard of living, percent poverty of the region’s country (PercPoverty), deprivation intensity (DepvrIntensity), which is the sum of the weights associated with each indicator in which a person in the region is deprived, and longitude/latitude data (longitude & latitude), which give the coordinates of the country in which the region lies.

The following table is a preview of the variables that we focus on in our investigation:

##        country                World.region popDensity PercPoverty MPI_Region
## 1  Afghanistan                  South Asia  811.15833        72.1      0.437
## 2  Afghanistan                  South Asia  811.15833        72.1      0.437
## 3       Bhutan                  South Asia   87.66666         4.3      0.016
## 4       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 5       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 6       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 7       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 8       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 9       Brazil Latin America and Caribbean 5005.61475         2.9      0.011
## 10      Brazil Latin America and Caribbean 5005.61475         2.9      0.011
##    precipitation TimeToCity LandClassification DepvrIntensity soil_phaq
## 1          120.5          3                200           60.7        NA
## 2          120.5          3                200           60.7        NA
## 3          885.0         10                 14           37.8        NA
## 4         1047.5         10                130           37.1        NA
## 5         1047.5         10                130           37.1        NA
## 6         1047.5         10                130           37.1        NA
## 7         1047.5         10                130           37.1        NA
## 8         1047.5         10                130           37.1        NA
## 9         1047.5         10                130           37.1        NA
## 10        1047.5         10                130           37.1        NA
##    Elevation MPI_Country Temperature Evaporation Modis_LAI AvgNightLight
## 1       1014       0.295          32         566         2            26
## 2       1014       0.295          32         566         2            26
## 3       2337       0.119          20        7776         5            22
## 4         10       0.021          29          NA        NA            62
## 5         10       0.021          29          NA        NA            62
## 6         10       0.021          29          NA        NA            62
## 7         10       0.021          29          NA        NA            62
## 8         10       0.021          29          NA        NA            62
## 9         10       0.021          29          NA        NA            62
## 10        10       0.021          29          NA        NA            62
##    soil_clay soil_silt soil_orgc
## 1         NA        NA        NA
## 2         NA        NA        NA
## 3         NA        NA        NA
## 4         NA        NA        NA
## 5         NA        NA        NA
## 6         NA        NA        NA
## 7         NA        NA        NA
## 8         NA        NA        NA
## 9         NA        NA        NA
## 10        NA        NA        NA

The following figure shows the distribution of the variable of interest, percent poverty (PercPoverty), for each of the world regions in the data set:

Focusing on these variables will allow us to determine the factors that have the greatest predictive power for predicting poverty in impoverished countries and investigate the relationship between poverty and other environmental, social, demographic, conflict, climate indicators.

RESULTS

##Question 1: How does poverty vary by geographic area?

We began our analysis by exploring the PercentPoverty variable–which was ultimately chosen to serve as the response in our predictive model. We wanted to develop a deeper understanding of what the distribution of this variable looked like across various countries and regions. Thus, we constructed a map to help visualize the level of poverty by country, noting that there was extreme variation in levels, as was to be expected. We concluded that the countries with the highest percent poverty levels were South Sudan, Chad, and Burkina Faso.

Further analysis determined that the regions with the highest levels of poverty were Sub Saharan Africa and South Asia, and we became interested in how poverty varied in these regions. We discovered that there is a large relative disparity between countries in these regions, which could be an important factor to consider when targeting certain areas. For instance, the mean poverty percentages in Swaziland are around 25%, whereas in Chad they are closer to 95%.

Importantly, some countries, such as Nigeria and Ethiopia, have rather extreme variation within their poverty levels. This is indicative of large local variation. There are a number of countries that have “outlying” impoverished regions. While these were noted, we did not decide to eliminate them from our dataset before moving on to our model, as these “outlying points” may be areas of extreme poverty, and thus, are important to include.

Thus, we conclude that poverty varies heavily by region, with the most impoverished countries being South Sudan, Chad, and Burkina Faso and the most impoverished regions being South Asian and Sub-Saharan Africa. We also noted that within countries with high levels of percent poverty there can be much variation.

##Question 2: How can we predict poverty?

We now turned to building a model to predict poverty, starting with a simpler model that only utilized one predictor: population density. In building this model, we discovered that the linear relationship between poverty and population density was not well defined and could not provide any valuable insight, so we turned to investigating the polynomial relationship of the logarithm of population density with respect to poverty. First, to determine the most significant degree, we look at the R-squared value plot, where it was apparent that degree 3 and 8 were the turning points. Each time we increased by 1 after 3 and 8, the improvement space got smaller. Notably, because of the larger R-squared value, degree 8 is preferred. With this model, we observed that as the population increases, poverty first rises and then decreases, starting from a population density log of 2.5.

## [1] "R-squared of degree 3 polynomial model:"
## [1] 0.08883367
## [1] "R-squared of degree 8 polynomial model:"
## [1] 0.1930201

Next, we decided to build a more complex model, utilizing step-wise regression to select predictors. Forward, backward, and both selection methods resulted in the same optimal model, which used the predictors precipitation, AvgNightLight, Elevation, Temperature, Evaporation, Modis_LAI, Modis_EVI, soil_orgc, and soil_phaq to predict poverty. Importantly, the soil related variables in the dataset contain a large number of N/A values, and the removal of these values to perform step-wise selection resulted in a significant number of rows being removed from the dataset. Thus, we also performed the same analysis again, this time excluding the soil predictors from the possible selectable predictors to reduce the number of removed rows. This model’s chosen predictors were: precipitation, AvgNightLight, LandClassification, Temperature, Evaporation, Modis_LAI, Modis_EVI, Conflicts_total, and Conflicts_totalDeathsCivilians. Comparing these two, we see that even though the first model has a higher R-squared, that it has more potential issues with normality, and is being impacted by an influential (outside of Cook’s Distance) outlier. The second model does not have these issues, but also, because it has less predictors, has a lower R-squared. However, neither of these models succeed in predicting more than half of the variability seen in our dataset.

## 
## Call:
## lm(formula = PercPoverty ~ precipitation + AvgNightLight + Elevation + 
##     Temperature + Evaporation + Modis_LAI + Modis_EVI + soil_orgc + 
##     soil_phaq, data = MPI_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.25 -12.99  -2.88  11.79  56.78 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.775e+01  3.692e+01   1.293 0.199161    
## precipitation  3.218e-03  6.063e-03   0.531 0.596829    
## AvgNightLight -6.173e-01  1.878e-01  -3.287 0.001446 ** 
## Elevation      8.608e-03  4.475e-03   1.924 0.057565 .  
## Temperature    2.832e+00  6.825e-01   4.150 7.54e-05 ***
## Evaporation   -2.688e-03  1.380e-03  -1.948 0.054475 .  
## Modis_LAI     -9.594e-02  3.626e-01  -0.265 0.791915    
## Modis_EVI      3.429e-04  5.736e-03   0.060 0.952460    
## soil_orgc     -2.417e-01  1.818e-01  -1.329 0.187148    
## soil_phaq     -1.077e+01  2.728e+00  -3.948 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.13 on 90 degrees of freedom
##   (894 observations deleted due to missingness)
## Multiple R-squared:  0.5035, Adjusted R-squared:  0.4539 
## F-statistic: 10.14 on 9 and 90 DF,  p-value: 1.259e-10
## 
## Call:
## lm(formula = PercPoverty ~ precipitation + AvgNightLight + LandClassification + 
##     Temperature + Evaporation + Modis_LAI + Modis_EVI + Conflicts_total + 
##     Conflicts_totalDeathsCivilians, data = MPI_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.21 -20.96  -0.13  20.06 100.69 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    33.0932617  7.7587874   4.265 2.27e-05 ***
## precipitation                   0.0107357  0.0020526   5.230 2.24e-07 ***
## AvgNightLight                  -0.8850583  0.0850304 -10.409  < 2e-16 ***
## LandClassification              0.0296410  0.0198609   1.492   0.1360    
## Temperature                     1.0199108  0.1996444   5.109 4.18e-07 ***
## Evaporation                    -0.0026128  0.0004966  -5.262 1.90e-07 ***
## Modis_LAI                       0.2555330  0.1247479   2.048   0.0409 *  
## Modis_EVI                      -0.0031710  0.0018583  -1.706   0.0884 .  
## Conflicts_total                 0.0966162  0.0520447   1.856   0.0638 .  
## Conflicts_totalDeathsCivilians -0.1113045  0.0651915  -1.707   0.0882 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.13 on 703 degrees of freedom
##   (281 observations deleted due to missingness)
## Multiple R-squared:  0.2834, Adjusted R-squared:  0.2742 
## F-statistic: 30.89 on 9 and 703 DF,  p-value: < 2.2e-16

In the first model, the two most significant predictors (with a p-value of <0.05) were Temperature and soil_phaq, a variable that contains a measure of the acidity or alkalinity in soil. In the second model, there are more predictors with a p-value of <0.5, including precipitation, AvgNightLight, Temperature, and Evaporation.

Because of the relatively low predictive power of our linear models, we decided to utilize a more sophisticated model type. After some research, we learned that LASSO regression methods can be useful in trying to build poverty prediction models. Thus, we attempted to build a final model using this method. We split the data into a training and testing set, and then built a LASSO model using the glmnet R package. Comparing the predicted values generated by our model trained on the training dataset with the actual values in the test dataset allowed us to calculate an R-squared value. Unfortunately, this value ( 0.2455236) was no better than the ones seen in our linear models, and in fact was worse than the first linear model, though comparable to the second.

## [1] "Lambda: "
## [1] 0.1174595

## [1] "RSQ: "
## [1] 0.2455236

CONCLUSION

IN LESS THAN 4 PARAGRAPHS, YOU SHOULD RESTATE YOUR QUESTIONS ALONG WITH YOUR CONCLUSIONS. THE PURPOSE OF THIS SECTION IS TO SUMMARIZE YOUR FINDINGS (SHORT), DEFEND THE IMPORTANCE OF YOUR RESULTS IN THE REAL WORLD (LONG), AND PROVIDE A ROADMAP FOR OTHERS TO CONTINUE THIS WORK (LONG). ARE YOUR CONCLUSIONS WHAT YOU EXPECTED OR UNUSUAL? WHY SHOULD SOMEONE CARE ABOUT THESE RESULTS? HOW COULD THESE RESULTS BE USED IN THE REAL WORLD? YOU SHOULD PROVIDE IDEAS ABOUT FUTURE DIRECTIONS ON WHERE YOUR MODELING COULD POSSIBLY BE IMPROVED. ARE THERE ANY METHODS YOU DIDN’T USE THAT MAY WORK BETTER? IS THERE DATA YOU DIDN’T HAVE ACCESS TO THAT MAY BE USEFUL IN THIS DATA ANALYSIS?